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Abstract 

The problems of model and variable selections for classification trees 
are jointly considered. A penalized criterion is proposed which explicitly 
takes into account the number of variables, and a risk bound inequality 
is provided for the tree classifier minimizing this criterion. This penal- 
^ ized criterion is compared to the one used during the pruning step of the 

I I CART algorithm. It is shown that the two criteria are similar under some 

specific margin assumptions. In practice, the tuning parameter of the 
CART penalty has to be calibrated by hold-out. Simulation studies are 
performed which confirm that the hold-out procedure mimics the form of 
the proposed penalized criterion. 

Keywords: Classification Tree, Variable Selection, Statistical Learning 
Theory 
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]~] 1 Introduction 

^ Since the pioneering work of Breiman tt al. [6], classification trees have be- 

k>< come a classical tool in machine learning. In particular, the Classification and 

Regression Tree (CART) algorithm is a well-established algorithm to build and 
prune tree predictors. This algorithm has been successfully applied in various 
fields, see for instance [1, 7, 10, 34]. 

1.1 Building/selecting a tree 

The process of building (or choosing) a tree classifier from a training set can be 
summarized into an optimization problem, where the goal is to find the "best" 
tree classifier / satisfying 

/ = argmin(P„/T +pen(n,r)) , (1.1) 
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where n is the number of observations, P„/t is the empirical risk of tree classifier 
/t based on tree T, and pen(n, T) is a penalty function based on the size of the 
training set and on the characteristics of T. 

Obtaining the best tree classifier / necessitates to solve a non-convex func- 
tion over a large set of trees, something unfeasible in practice. As an alternative, 
a 2-step heuristic approach to solve this problem has been proposed in [6], in 
the particular case where the penalized criterion is of the form 

pen{n,T)=an x \T\ , (1.2) 

where a„ is a tuning parameter that depends on n, and |T| is the size of the 
tree, i.e. the number of leaves (terminal nodes) of T. In the first step (called 
the growing step) a large tree T^ax that achieves a perfect classification on the 
training set is built. Then, during the second step (called the pruning step), 
the optimal subtree is obtained from the large tree, where the optimal subtree 
satisfies 

fprun= argmin P„/t + a„ x |T| . 

/t, TGT„,a^ 

While this heuristic approach is at the heart of the CART algorithm and is 
probably the most popular strategy to prune a tree, one should keep in mind 
that the actual goal is in fact to solve Problem (1.1), and to obtain the proper- 
ties of /, whatever the (approximate) strategy that is applied to find it. 

From a theoretical point of view, many works have investigated the perfor- 
mance of the tree classifier resulting from the pruning step of CART rather than 
from the generic optimization problem. In the Gaussian or bounded regression 
context, penalty (1.2) was validated in [14] using model selection framework. 
Another validation was obtained in the classification framework in [28]. More 
recently, a refined analysis of the pruning step was proposed in [12], where 
margin adaptive risk bounds were obtained in the binary classification context. 
Importantly, these theoretical results are actually obtained conditionally to the 
construction of T^ax- This means that only the performance of the pruning 
step is assessed, while the growing step is not taken into account. 



1.2 Classification trees and variable selection 

Because they are based on the 2-step heuristic of the CART algorithm, results 
obtained so far fail to take into account the complete process of obtaining a 
tree classifier. In particular, the embedded variable selection process that is in- 
herent to tree classification algorithms has never been investigated. A variable 
selection process is called embedded when it is included in the training step of 
the classification algorithm. Therefore the learning and variable selection pro- 
cesses cannot be separated. This embedded property is actually one of the main 
arguments for the use of tree classifiers to deal with large dimension data (see 
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[5, 11, 13] for example). Note that in the CART algorithm, the inner selection 
process results from the recursive growing strategy of the tree: at each node, the 
"best" variable is selected among all for splitting. As a result, in many cases 
the maximal tree (and consequently all of its subtrees) only includes a small 
subset of the p initial variables. As a consequence, as long as tree classifiers are 
studied through the pruning step of the CART heuristic (hence conditionally to 
the growing step), it is impossible to investigate the complete variable selection 
process. 

Although the embedded variable selection process is well-known ([8, 15, 20]), 
it may appear at first glance that it is not correctly handled in the optimization 
program 

/ = argmin (P„/t + x |r|) , (1.3) 

JT 

assuming the form of the penalty proposed in [6] is correct. Indeed, this pe- 
nalized criterion does not obviously depend on the total number of covariates 
p. This can be astonishing: in both the regression and classification frame- 
works, theoretical studies have shown that in the variable selection context, an 
extra term should be added to the penalty that is used when only one model is 
considered per dimension ([2, 24]) to obtain oracle-type inequalities. Since the 
collection of possible trees increases with p, p should play a crucial role in the 
regularization term. 

Since parameter p does not explicitly appear in criterion (1.3), one can argue 
that p is hidden in the constant term a„. This argument is verified from at least 
two penalties that can be exhibited from previous works: 

• In [28] (equation 4), the penalty term has the form 



pen{\Tln) = x ■ ' l^'^l^S" 




In [12] (Theorem 1), the penalty term is of order 

pen{\Tln) « ^ Plog + logWlog b))) ^ 
= oi{p,n)\T\ , 



where Ci and C2 are known constants. While these two penalty functions de- 
pend on p, one can observe that their scaling order is much larger than the 
log(p) usually obtained in the variable selection context [2, 24]. 
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1.3 Contribution 

The goal of the present paper is to investigate the classification performance of 
the tree classifier obtained by solving Problem 1.1, and to decipher the exact 
impact of variable selection on tree classifier selection. While this impact is the- 
oretically studied through an ideal exhaustive selection procedure (unfeasible in 
practice), it sheds light on the heuristic procedures currently used in practice 
to mimic the ideal one (see Section 3.2). From a theoretical point of view, we 
consider the model selection problem where the goal is to select a candidate 
from all possible tree classifiers. The strategy consists in choosing the candi- 
date minimizing a penalized criterion that depends on parameters p and n. In 
this model selection context, we exhibit a penalization function where the vari- 
able selection process is explicitly taken into account, and provide performance 
guarantees for the candidate tree classifier through an upper bound of its risk. 
Then it is shown that the impact of variable selection, although investigated via 
the theoretical minimization problem (1.1), can also be exhibited in practice for 
practical heuristic approaches. More precisely, a simulation study is performed 
which shows that the proposed theoretical penalization function is actually the 
one that is implicitly used in the pruning step of the CART algorithm. 

The paper is organized as follows. Section 2 presents the framework of binary 
classification and describes tree classifiers. The main theoretical contribution 
and the simulation study are presented in Section 3. Some discussion is devel- 
oped in Section 4, and finally Section 5 gives the proofs of the results presented 
in Section 3. 

2 Context 

2.1 Classification framework 

The considered classification framework is the following. Suppose one observes a 
sample {(^i, Yi), . . . , (X„, Y^)} of n independent copies of the random variable 
(X,Y), where the explanatory variable X takes values in a measurable space 
X of dimension p ^ 2, and is associated with a label Y taking values in {0, 1}. 
Suppose moreover that each coordinate of X is ordered (i.e. X is a product of p 
ordered subspaces). A classifier is then any function / mapping X into {0, 1}. 
The quality of a classifier is measured by its misclassification rate 



where P denotes the joint distribution of {X,Y). If the joint distribution of 
(X, Y) were known, the problem of finding an optimal classifier minimizing the 
misclassification rate would be easily solved by considering the Bayes classifier 
/* defined for every x ^ X hy 



P/:=P(/(X)^y) 



(2.1) 



(2.2) 
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where r]{x) is the conditional expectation of Y given X — x, that is 

r]{x) :^P[Y = 1\ X = x] . (2.3) 

As P is unknown, the goal is to construct from sample {(^i, Yi), . . . , {Xn, Yn)} 
a classifier / that is as close as possible to /* in the following sense: since /* 
minimizes the misclassification rate, / will be chosen in such a way that its 
misclassification rate is as close as possible to the misclassification rate of /*, 
i.e. in such a way that the loss 

lirj~) = P(/(^) ^Y)- P(/*(X) ^ Y) (2.4) 
is as small as possible. 

Many strategies or classification algorithms have been proposed to build / (see 
[16], [3] for an overview). The quality of a strategy is measured by its risk 

mrJ)] , 

where the expectation is taken with respect to the sample distribution. In the 
model selection framework, two strategies are usually considered: 

• Empirical Risk Minimization: / is chosen as the minimizer of 

1 " 

P»/^=;,E%(^.)#v.} , (2.5) 

1=1 

over all classifiers / belonging to a single class of classifiers, 

• Structural Risk Minimization: / is chosen as the minimizer of the penal- 
ized empirical risk over a collection of classes. 

2.2 Margin assumptions 

It is now well known that without any assumption on the joint distribution 
P, when considering a class of classifiers with finite Vapnik Chervonenkis (VC) 
dimension, the minimax convergence rate of the risk bound is of order 0{l/y/n). 
It has also been shown that, under the overoptimistic zero-error assumption 
(that is y = r]{X) almost surely, where rj is defined by (2.3)), this minimax 
convergence rate is at best of order 0{l/n) (see [33, 22] for example). 

These two extreme cases can be modulated by so-called margin assumptions 
that make the link between the "global" pessimistic case (without any assump- 
tion on P) and the zero-error case ([18, 19, 23, 27, 26, 31, 32]). 

In this paper, we consider the margin assumption proposed in [23]: 
MA(1) There exist some constants Co > and k > 1 such that, for all t > 0, 

Pi\2r^{X)~l\^t)^Cot^, (2.6) 
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Note that by taking t = h g]0, 1[ and the hmit value k = 1, we obtain the 
stronger assumption proposed in [27] (see also the slightly weaker condition 
proposed in [17]): 

MA(2) There exists h e]0; 1[ such that 

P{\2ri{X)-l\!^h) = 0. (2.7) 

Assumption MA(2) has an intuitive interpretation. It means that (X, Y) is 
sufficiently well distributed to ensure that there is no region in X for which the 
toss-up strategy could be favored over others: h can be viewed as a measurement 
of the gap between labels and 1 in the sense that, if 'ri{x) is too close to 1/2, 
then choosing or 1 will not make a real difference for that x. From a general 
point of view, the margin parameter quantifies the noise level of the classification 
problem, and may be understood as the equivalent of the variance parameter in 
the Gaussian model selection setting. 

2.3 Tree classifiers, classes of tree classifiers 

A tree T is a structure that can be represented as a hierarchy whose elements 
are called nodes. For binary trees, each node has either or 2 children (called 
Left and Right). The initial node is called the root of the tree and a node 
with no child is called a leaf. The size of tree T is defined as the number of its 
leaves and noted |r| in the following. In this paper, we define a tree Tcc by two 
elements: 

• its configuration c, i.e. the hierarchy between the nodes: for instance, in 
Figure 1, we know that node 6 is the Left child node of node 3, and so on, 

• the ordered list £ of variables that appear at each node, i.e. the fc*'* variable 
in the list appears in node k. 




Figure 1: Tree configuration example: for each node, the parent and child nodes 
are known. 

A tree classifier / based on tree associates 
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X3>0.34 X3>0.58 




Figure 2: Two tree classifiers that belong to the same class. 



• at each internal node a condition of the form "X^'' > s*^", where jk is the 
index of the variable associated with node k and s'' is a threshold, 

• at each terminal node a label (here or 1). 

Therefore, an observation x G X will be classified as follows: starting at the 
root, observation x will move from a node of / to another using the following 
rule: at node k, if "x-"" > s'"' then x moves to Right, otherwise it moves to Left. 
At the end of the process, x will be classified according to the label of the leaf 
it reaches. 

To summarize, a tree classifier associated with tree T^i splits X into \Tci\ regions 
each associated with a label, and two classifiers associated with the same tree 
Teg differ in that the thresholds (for internal nodes) and labels (for leaves) are 
not the same. An example of two such tree classifiers is given in Figure 2. In 
the following, we will consider classes Cd = {/ // based on Td} of classifiers 
based on a same tree T^i. 
Finally, we define 

earg minP/, (2.8) 

where P/ is defined by (2.1). 



3 Results 

3.1 Risk bounds 

We first consider a single class Cd of tree classifiers and its associated empirical 
risk minimizer 

fci e arg min P„/, 

where P„/ is defined by (2.5). 
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Proposition 1. Assume that margin assumption MA(1) is verified. For all 
tci > and a > 0, there exist positive constants Ki, K2, K depending on a, 
Co and k such that, with probability at least 1 — e^^"' , 

Kf.u < (1 +»)<(/•,/.) + feyMM) ^ + (k) + 

\ n J V " / 

Moreover, we obtain the following upper bound 

E [l{f\U] ^ (1 + a)l{f*J,,) + K, ( + Cn-^ .(3.2) 

The obtained bound is in keeping with classical results already given in [23] . 
In particular, if the Bayes classifier belongs to class Cd, the rate of convergence 
for the risk associated with estimator Jet is of order (log(2n)/n) ^^-i . 

In practice, since no information is available about how to choose class Cd, 
one needs to consider the collection M. of all possible configurations and variable 
lists. In each class Cc£, a candidate is chosen by empirical risk minimization, 
then the final classifier / is selected among all class candidates by minimization 
of a penalized criterion: 



ci = argmin (P„/c£ +pen(c, £) 



/ /cf ■ 

The following result provides insight about how the penalty should be chosen 
to ensure good performance for /. 

Proposition 2. Assume that margin assumption MA(1) is verified. If 

f= argmin fp„/rf + pen(c, ^) j , (3.3) 



whe 



with constants C'^ and C" depending on Co and k appearing in the margin condi- 
tion, then there exist positive constants C[, C2 and E such that with probability 
at least 1 — 3Se~^ 



+ - 
n 



l{f*,f) C^inf ( inf l{f*,f)+pen{c,e)\+C^ ( (-) 
cj l^feCai J \\n/ 

Moreover, we obtain the following upper bound: 

mirJ)] ^ C[ inf^ I inf l{f\f) +pen{cj)\ + (3.5) 
The proofs of Propositions 1 and 2 are given in Section 5. 



Several comments can be made about the result of Proposition 2: 



Quality of the upper bound Compared with previous results [28, 12], the 
upper bound for the risk is improved in two different ways. First, since all pos- 
sible binary trees are considered, in the present result the complete construction 
path of the tree classifier is taken into account: the infimum in equation (3.5) 
is taken on all possible classes of tree classifiers. Conversely, in previous results 
only the performance of the pruning step was assessed, i.e. the corresponding 
infimum was restricted to the list of classes associated with subtrees of the max- 
imal tree. Second, thanks to the margin hypothesis, the convergence rate of the 
upper bound is faster than 0{\/y/ri) as soon as k < -l-oo. 

Margin parameter The proposed penalty (3.4) depends on the margin pa- 
rameter K, that is usually unknown in practice. From a theoretical point of 
view, because this parameter quantifies the noise level of the classification prob- 
lem, it necessarily appears in the ideal penalty function (as does the unknown 
variance in Gaussian model selection). From a practical point of view, it has to 
be estimated from the data. Obtaining this estimate in the general case is an 
open question. 

Strong margin assumption In the particular case of margin assumption 
MA(2) given by equation (2.7), penalty (3.4) becomes (taking n — 1): 

,e.(c,£) = Cllo,i2n)+Cllo,i,) 

n 

= an\Tce\. 

This corresponds exactly to the penalty proposed in [6] for the CART algorithm 
(see equation (1.2)). This penalty function has already been validated for the 
pruning step of the CART algorithm, (see [14] for the regression framework and 
[12] for the binary classification framework). A similar result is established by 
Proposition 2 when considering the exact optimization problem (1.1). Also note 
that in this context the margin parameter only appears in constant a„ . Because 
this constant will be tuned accordingly to the data (using cross-validation for 
instance), the problem of estimating the margin parameter is discarded. 

Variable selection In comparison with the upper bound obtained in Propo- 
sition 1, one can observe in (3.5) the impact of parameter p that appears through 
the penalty. This quantity arises during the union bound step of the proof (see 
Section 5.3), where one has to count the number of classes sharing the same 
complexity. This conveys the fact that to build an optimal tree of size k, one 
has to choose k variables among p (with replacement). This is obviously a much 
easier task when p — 100 than when p = 10,000. This is where the variable 
selection task is taken into account. Moreover, the penalty term can be upper 
bounded by 
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advocating for a penalty that should be linear with respect to \og{p). This Unear 
relationship is investigated in Section 3.2. 

Oracle-type inequality Vapnik-Chervonenkis bounds for binary classifica- 
tion without any margin assumption give the following penalty form (see [9] for 
instance) 

. ,^ ^1 l \Tci\\ogin) , ^2 \Tct\ 

penv (c,e) = Cy\ 1- 6y . 

V n n 

This implies that, for classes associated with trees of large size, pen{c,£) given 
in (3.4) becomes larger than penv{c,£)- Therefore, to obtain an oracle-type 
inequality, pen{c,£) can be replaced by mm {peny (c, £), pen{c, £)} . 

3.2 Illustration on simulated data 
3.2.1 Practical determination of / 

The application of the strategy described in Proposition 2 necessitates find- 
ing the empirical risk minimizer in each class Cd, and then comparing all the 
candidates fee using the penalized criterion given by (3.3). From a computa- 
tional point of view, the exhaustive comparison among all classes is an NP-hard 
problem. Therefore we need heuristic algorithms to obtain a sequence of near- 
optimal penalized risk minimizers ( fk ] such that 

fk w ^argmin Pnfct ■ 

{fae, \T^i\ = k} 

The CART algorithm, when applied with the empirical risk as an impurity 
measure at each node (see [16]), may be understood as a forward heuristic algo- 
rithm to build the sequence of optimal tree classifiers. In particular, the subtree 
classifier f^ of size k extracted from the maximal tree can be interpreted as the 
(approximate) optimizer of the empirical risk over all the possible trees of size k. 

This new understanding of the CART algorithm as a heuristic approach to 
obtain the sequence of subtree minimizers is important, because it points out 
that these subtree classifiers fk should be penalized as if the exhaustive search 
were performed, i.e. using penalty given by (3.4). 

In most applications, when dealing with the construction of a tree classi- 
fier, experimenters use criterion (1.2) in a growing-pruning strategy, and the 
unknown parameter q;„ is chosen by hold-out or Q-fold cross-validation. This 
estimated value can be compared with its theoretical counterpart given in (3.4). 
To this end, we perform a simulation study and compare the q;„ obtained by 
cross-validation to its theoretical form 

Cilog(2n)-f C^log(p) 
n 
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obtained under the strong margin assumption MA(2). 

3.2.2 Simulations 

We consider four simulation designs: 

Design 1 Variables X"^ , ...,Xp are independently generated with distribution 
A/'(0, 1). The label is generated as follows: If > and > then F = 1 
with probability q, otherwise Y = \ with probability 1 — q. Therefore only 
variables and X"^ are informative. In this design, the Bayes classifier can be 
represented as a tree with 3 leaves, hence it belongs to the considered collection 
of classes. Moreover, variables are independent, and margin assumption MA(2) 
is satisfied. 

Design 2 First the labels are generated according to a Bernoulli distribution 
with parameter 1/2. Then variable is generated such that X^jF = and 
X-'^iy = 1 are normally distributed with means and 1, respectively, and vari- 
ance (T^. Variables X'^,...,X^ are independent with distribution A/'(0, 1) and 
arc non-informative. As for design 1, the Bayes classifier can be represented 
as a tree and variables are independent, but it is easy to show that margin 
assumption MA(2) is not satisfied. 

Design 3 Labels are simulated as in design 2. Then variables X^ and X"^ 
arc generated such that, for j = 1,2, X-^ jy = and X-'iy = 1 are normally 
distributed with means and 1, respectively, and variance o"^. The last p — 2 
variables are independent and non-informative. Here the Bayes classifier no 
longer belongs to the collection of tree classes, and margin assumption MA(2) 
is not satisfied. 



Design 4 Three independent variables X'^^X'^^X^ are generated with distri- 
bution A/'(0, 1). Each additional variable X^ is then simulated as a noisy copy of 
{X^+X'^+X^)/^/^. The label is generated as follows: li{X^ f + {X'^ f+{X^f > 
2.5 then Y = 1, else Y = 0. Here, all the variables are correlated (with a strong 
correlation between the extra variables), the Bayes classifier cannot be repre- 
sented as a tree, and margin assumption MA (2) is not satisfied. 

For designs 1 to 3, 400 samples are generated, and 1000 for design 4. On each 
of them, a tree classifier is selected using the growing/pruning strategy, where 
parameter a„ is selected by 10-fold cross-validation. Different values of param- 
eters n (n = 50, 100,200) and noise {q = 0.1,0.2,0.3 in design 1, cr^ = 0.5, 1,2 
in designs 2 and 3, and = 0.2 in design 4) are used. The number of variables 
considered to build the classifiers grows from p = 30 to p= 10^. 
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Figure 3 displays the average value (on 400 simulations) of a„ versus the 
log- number of variables for the different designs. Parameter a„ decreases with 
respect to n, and the relationship between the selected «„ and \ogp is linear. 
These behaviors are observed whatever the level of noise (not shown) and what- 
ever the design. This confirms that variable selection is taken into account by 
the pruning procedure of the CART algorithm through the choice of a„. This 
also suggests that the penalty function proposed in (3.4) is relevant regarding 
its dependency on log p. 




Figure 3: Average value of a„ with respect to logp, for n = 50 (+), n = 100 
(x) and n — 200 (*). Data are simulated from design 1 with q = 0.3 (Top 
Left), design 2 (Top Right), design 3 (Bottom Left) with cr^ — 2. For design 4 
(Bottom Right) the average a„ is obtained over 1000 samples, for n — 100. 



4 Discussion 

As stated in the Introduction, most previous results are related to the pruning 
step of the CART algorithm rather than considering the general optimization 
problem (1.1). For instance, in [12] and [28], risk bounds are obtained for 
the collection of CART pruned subtrees, which itself depends on the data at 
hand: the collection of models includes classes Cq, ...,Ck-i,Ck of tree classifiers 
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built on the maximal tree Tk, obtained from the training set, and its subtrees 
To =^ ... =^ Tk-1- Thus the conditional risk bounds provided in previous ar- 
ticles only guarantee that the risk of the candidate is at most of the order of 
the risk of class Ck- corresponding to the best subtree Tj.. . While this exactly 
describes the process of the CART algorithm, the guarantee may be poor if the 
best subtree of the collection is far from the best tree among all possible trees. 
Conversely, the approach presented here guarantees that the risk bound for the 
selected tree classifier is comparable to the risk of the class corresponding to the 
optimal tree (among all possible trees). 

Proposition 2 generalizes the results obtained in [30] in two ways. First, 
Scott and Nowak considered the particular case where the tree classifiers are 
constructed on a fixed dyadic grid. In dyadic trees, the choice of the threshold 
at each internal node is deterministic, instead of being optimally tuned on the 
training set. This optimization is taken into account in the results presented 
here. Second, as recalled in Section 2, without any margin assumption, the 
penalty functions obtained in [30] are naturally proportional to the square root 
of the tree size over n. A \/\ogp factor also appears in the resulting penalties. 
In comparison, the results presented here exhibit a range of penalty function 
from square root to linear depending on the margin assumption. If MA(2) 
is satisfied, this validates the form of the penalty implemented in the CART 
algorithm. If ]V[A(1) is satisfied, it leads to better convergence rates for the 
risk bound. 

Whenever margin assumption MA(1) is satisfied, the penalty suggested in 
Proposition 2 is sublinear. In this case the heuristic approach of the CART 
algorithm can still be employed to obtain an approximate version of /. In- 
deed, as proved in [29], pruning with subadditive penalties produces sequences 
of pruned subtrees included in the sequence obtained through pruning with a 
linear penalty. This means that one can obtain an approximate optimizer of 
criterion (3.3), to the condition that the margin parameter is known. 

The theoretical form of the penalty term (3.4) derived in Proposition 2 is of 
practical interest. First, it shows that sequential selection algorithms, such as 
stepwise or backward variable selection methods, can be easily studied in the 
model selection framework where the selection is supposed to be exhaustive. In 
the particular case of tree classification, the simulation study confirms that the 
penalty derived under the hypothesis of exhaustive variable selection is the one 
that is used in practice by the CART algorithm, that proceeds as a forward 
variable selection process. Second, it provides an interesting insight into the 
CART variable selection process. Indeed, the definition of the classes comes 
from the fact that a single variable may appear at different nodes, a specificity 
that changes the classical way of taking into account variable selection in the 
penalty term: in trees the variable list is ordered (the first variable of the list 
is associated with the first node) and a variable may be associated with several 
nodes. Therefore the classical {^^-^ term that appears in penalties in [2] or [24] 
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(i.e. the number of samplings without replacements and unordered sample) is 
replaced with p'^~^ (i.e. the number of sampling with replacements and ordered 
sample) . 

In [18], Koltchinskii provides a synthesis of oracle inequalities in classifica- 
tion. In particular, the author considers margin assumptions more general than 
the margin assumption MA(1) given in [23]. The in-probability upper bounds 
for the loss l{f*,f) given in Propositions 1 and 2 can be straightforwardly 
generalized using Koltchinskii's margin definition. This would lead to improved 
in-probability upper bounds for the loss ?(/*, /), similar to the one given in The- 
orem 6 of [18]. However, unlike hypothesis MA(1) considered here, it would 
not permit one to obtain explicit rates of convergence for the risk. Importantly, 
using a more general margin assumption would provide no improvement con- 
cerning the embedded selection aspect that we investigated here. From this 
aspect the results obtained are tight, as illustrated by the simulation study. 
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5 Proofs 

5.1 Preliminary results 

We provide two lemmas regarding the Vapnik entropy and the cardinality of 
tree class collections. 

Note Hcg the Vapnik-Chervonenkis log-entropy of class Ccf- 

H,, - log \{A{f) n {Xi, . . . , x„}, / e Crf}], 

where A{f) = {x e X : f{x) = 1}. 

Lemma 1. For a tree class Cd, one has 

EiH,e) |Trf|log(27i) 

This is obtained from lemma (2) in [14]. For a tree with \Tce\ leaves, there are 
\Tci\ ~ 1 nodes for which the thresholds have to be estimated, leading to at 
most n ways to split the training sample. The possible number of splittings is 
bounded by nl'^'=*l~^. A given splitting shatters the sample into \Tci \ subsamples, 
and each of these subsamples receive label or 1. There are 2^"^"'^ ways to label 
the subsamples, hence 

i/rf < log(nl^=*l-i X 2l'^'^*l) 
< lrrfllog(2n) . 
Taking the expectation leads to the result. 



14 



Lemma 2. The number of classes of trees of size k is 

p'-'N^k), with 7V(,)^_(^^_J . 

First note that counting the number of classes amounts to counting the number 
of trees. A tree is defined by a configuration c combined with a variable 
list I. The total number of tree configurations of size k is given by the Catalan 
number iV(fc). The total number of lists of A; — 1 variables is p*^"^, because at 
each node we have to choose between the p available variables. Combined with 
the total number of tree configurations, this leads to the proposed lemma. 



Remark In contrast with the classical variable selection framework, in trees 
the variable list is ordered (the first variable of the list is associated with the 
first node) and a variable may be associated with several nodes. Therefore the 
classical {^.^-^ term that appears in penalties in [2] or [24] (i.e. the number of 
samplings without replacements and unordered sample) is replaced with p^^^ 
(i.e. the number of sampling with replacements and ordered sample). 



5.2 Proof of Proposition 1 

A classical way to bound l{f* , fd) is to use the following decomposition: 

iirJci)^iif*Jci) + pfci-p7c„ 

and then to upper bound the variance term Pfcg — Pfd- In the case where class 
Ccc is finite, an upper bound can be obtained by using Bernstein inequality, as 
developped in [21] for instance. In our setting, because there may be (at least) 
one continuous coordinate (i.e. one continuous variable), classes Cd are not 
finite. In this case, the upper bounding can be done using Theorem 2 from [18], 
which can be restated for our purpose as follows: 

Theorem 5.2.1 (Koltchinskii, 2006). // there exists a nondecreasing strictly 
concave function ipci ■ ^+ IR+ such that with probability at least 1 — e'^^"' 

sup |(P„-P)(/-g)| <^e£(<5) , 

and if ip'^^^ is defined as 

^pl^{£) = inf{,5 >0 s.t. sup^^ < e} , 



then for all 6 > ■0'^(l/g) 



Pfci - P/rf > S 



e" 
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In order to use Theorem 5.2.1, we need to provide an explicit expression for 
tpc£- To proceed, we start from the foUowing probabihstic upper bound given in 
[18] and derived from Talagrand's inequahty for bounded processes (see [4] for 
more details): 



sup |(P„-P)(/-5)K2 E 



sup |(P„-P)(/-5)I 



ltd , t 



with probability larger than 1 — e where 

CAS) = {/ e Crf s.t. Pf - P/,, ^ 5} 

and 

D{c,,{6)) - sup ^nU-g?) 

= sup d{f,g) 

This last term can be upper-bounded in expression (5.1) using the margin as- 
sumption MA(1) described by (2.6): 



where Ck. = {k — 1)^C"' ^ ^ 



Hence 



K-1 

d{f,g) ^ 2^,{l{rJ,,)^^ +5^^ 



D 



(5.2) 



Now because 
E 



sup |(P„-P)(/-.g)| 



C E 



sup |(P„ 

d(f.g)^D 



P)(/-5)l 



we can use the result of [25] (p295) to obtain 



E 



sup |(P„-P) (/-<?) I 

/,S6Cef(5) 



sC 2AD 



(5.3) 



where is the Vapnik-Chervonenkis log-entropy of Cel.- Combining (5.2) and 
(5.3), then using lemma 5 of [32], we obtain for all a e]0, 1[ 



sup |(P„-P)(/-5)| 2 



2Va 24 



Krjc 



4Va 24 



+2— +«/(/*,/,,) +/3«,„ 
n 



E[H, 



fE[H,e] 



5^^ 



24 
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In the present framework, we then have 

'E[H,e] It 



n 



f E[HA 
\ n 



where 

MS) 
MS) 

and K 



V 11 



ct 



24 V n 



Moreover, -(/'"^(e) < il\{eli) + V'f (e/3) + ^, and V? and V2 

can be determined 

using the following characterization (available for all strictly concave functions 

M 

Solving this last equation for the particular form of functions ^/^i and '02; we 
obtain 



e\/n 



ct 



~2A V n~ 



Taking £ = 1/q one has with probability larger than 1 — e 



\ n 



+6g h 3qal{f*J^i) 

n 



Using Lemma 1 and rescaling a properly, this leads to 



24 



i^,- .(5.4) 



Renaming Ka,K,q — E^a,K,q — ^2 and Kg = K leads to the first expression 
in Proposition 1. The risk bound follows by integration. 
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5.3 Proof of Proposition 2 

We first choose the weights — Xd + x associated with classes Cce such that 
Xce and x are positive and 

e-"""' = S < +C50 . 

The exact form of the weights will be chosen later. Furthermore, we will use 
lemma 4 of [18], reformulated here for our purpose: 

Lemma 5.3.1 (Koltchinskii, 2006). Consider a class Cd and assume that 
MA(1) is satisfied. For all td > and a e]0, 2/5[, with probability at least 
1 — 26^^"'^ , one has 

PrJd " Pnf* < (1 + a)iP fci - Pfl + ( — ) + — (5.5) 

\ n J n 

and 

P7. - Pr < (l - 5o) (P,J. - P„/- + ^A, (E^) + -SK, (^) + 3A-^' 

with the same notations as above. 

We start the proof from the result obtained in Proposition 1. Combining 
equation (3.1) of Proposition 1 and a classical union bound argument, one has 
with probability larger than 1 — Ee~^ 

. . ~^ , ^ . . - . /I'T'-J log(2n)\ ^ /x', + x\^ X'. + x 

\n/ \ n J n 

where a g]0, 2/5[. We now use equation (5.6) from Lemma 5.3.1 to obtain with 
probability larger than 1 — 3Ee~^ 



{1 + a) ^ „ , 5K, ^|T5|log(2n)V''- , , fx^ + x 
1 



2 \ 

(1 + a) / ^ .3 , 5i^i /|T5|log(2n)\^ , ^ /^.TA^ , ...^2 



1 

2 



n 



(1 + a) / /a;\27r^ ^^^x 
1 - ^ V n 

In the context of variable selection, one has to choose the weights such that 
^ e"^"^* < +00 ^ ^ ^ 6"^"=* < +0O . 

c,l k Cat s.t. \Tac\=k 
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Giving equal weights Xk to classes of same complexity k (i.e. classes Crf and 
Cc'i' such that \Tci\ = \Tcit \ = k), one obtains from Lemma 2: 



fc Ccf s.t. \Tae\=k k ^ 



k 



The choice Xcg = x\T^t\ = AjT^I log(p) with A > 3 ensures that the sum is finite. 
Hence, 



1 — ^ V ~ ^nJ n 



n J \ n J \ n 



+ ^ ^(4if2(-) +4if 



1 — ^ V ^n/ n 



for a proper choice of constants C^, C", and C"'. This leads to 
Hr , /) < ^^inf (P„7., - P„/* + penic I)) + ^j±^ ^AK, (^)^ + AK^ 

Since Pnfci—Pnf* ^ Pri/rf~Pn/* ( by definition of fd), this last expression can 
be upper bounded (with probability larger than 1 — 3Se^^) thanks to equation 
(5.5) of Lemma 5.3.1: 

(l + a) ^4^, CE-)"^ , 



2 

^ 2('l +: 



1 _ 5a > - 



< C[M (P/,, - P/* + penic £)) + C'J(^^)~' + 



The last inequality corresponds to the first equation of Proposition 2. The risk 
bound follows by integration. 
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